Fault Tolerance in Networks of Bounded Degree

نویسندگان

  • Cynthia Dwork
  • David Peleg
  • Nicholas Pippenger
  • Eli Upfal
چکیده

Achieving processor cooperation in the presence of faults is a major problem in distributed systems. Popular paradigms such as Byzantine agreement have been studied principally in the context of a complete network. Indeed, Dolev [J. Algorithms, 3 (1982), pp. 14-30] and Hadzilacos [Issues of Fault Tolerance in Concurrent Computations, Ph.D. thesis, Harvard University, Cambridge, MA, 1984] have shown that fl(t) connectivity is necessary if the requirement is that all nonfaulty processors decide unanlmously, where is the number of faults to be tolerated. We believe that in forseeable technologies the number of faults will grow with the size of the network while the degree will remain practically fixed. We therefore raise the question whether it is possible to avoid the connectivity requirements by slightly lowering our expectations. In many practical situations we may be willing to "lose" some correct processors and settle for cooperation between the vast majority of the processors. Thus motivated, we present a general simulation technique by which vertices (processors) in almost any network of bounded degree can simulate an algorithm designed for the complete network. The simulation has the property that although some correct processors may be cut off from the majority of the network by faulty processors, the vast majority of the correct processors will be able to communicate among themselves undisturbed by the (arbitrary) behavior of the faulty nodes. We define a new paradigm for distributed computing, almost-everywhere agreement, in which we require only that almost all correct processors reach consensus. Unlike the traditional Byzantine agreement problem, almost-everywhere agreement can be solved on networks of bounded degree. Specifically, we can simulate any sufficiently resilient Byzantine agreement algorithm on a network ofbounded degree using our communication scheme described above. Although we "lose" some correct processors, effectively treating them as faulty, the vast majority of correct processors decide on a common value. Key words, fault tolerance, communication, bounded-degree network, expander graph AMS(MOS) subject classifications. 68M10, 68M15, 68R10 1. Preliminaries. In 1982 Dolev [D] published the following damning result for distributed computing: "Byzantine agreement is achievable only if the number of faulty processors in the system is less than one-half of the connectivity of the system’s network." Even in the absence of malicious failures connectivity + 1 is required to achieve agreement in the presence of faulty processors [H]. The results are viewed as damning because of the fundamental nature of the Byzantine agreement problem. In this problem each processor begins with an initial value drawn from some domain V of possible values. At some point during the computation, during which processors repeatedly exchange messages and perform local computations, each processor must irreversibly decide on a value, subject to two conditions. No two correct processors may decide on different values, and if all correct processors begin with the same value v, then v must be the common decision value. (See [F] for a survey of related problems.) The ability to achieve this type of coordination is important in a wide range of applications, such as database management, fault-tolerant analysis of sensor readings, and coordinated control of multiple agents. A simple corollary of the results of Dolev and Hadzilacos is that in order for a system to be able to reach Byzantine agreement in the presence of up to faulty processors, every processor must be directly connected to at least fl(t) others. Such high connectivity, while feasible in a small system, cannot be implemented at reasonable cost in a large system. As technology improves, increasingly large distributed systems and parallel computers will be constructed. However, in any forthcoming technology, the number of * Received by the editors June 17, 1986; accepted for publication (in revised form) November 3, 1987. f IBM Almaden Research Center, San Jose, California 95120-6099.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Fault Tolerance of Some Popular Bounded-Degree Networks

In this paper, we analyze the fault tolerance of several bounded-degree networks that are commonly used for parallel computation. Among other things, we show that an N-node butterry network containing N 1? worst-case faults (for any constant > 0) can emulate a fault-free butterry of the same size with only constant slowdown. The same result is proved for the shuue-exchange network. Hence, these...

متن کامل

Hyper Butterfly Network: A Scalable Optimally Fault Tolerant Architecture

Bounded degree networks like deBruijn graphs or wrapped butterfly networks are very important from VLSI implementation point of view as well as for applications where the computing nodes in the interconnection networks can have only a fixed number of I/O ports. One basic drawback of these networks is that they cannot provide a desired level of fault tolerance because of the bounded degree of th...

متن کامل

On the Fault Tolerance of Some

In this paper, we analyze the fault tolerance of several bounded-degree networks that are commonly used for parallel computation. Among other things, we show that an N-node butterry network containing N 1? worst-case faults (for any constant > 0) can emulate a fault-free butterry of the same size with only constant slowdown. The same result is proved for the shuue-exchange network. Hence, these...

متن کامل

FDMG: Fault detection method by using genetic algorithm in clustered wireless sensor networks

Wireless sensor networks (WSNs) consist of a large number of sensor nodes which are capable of sensing different environmental phenomena and sending the collected data to the base station or Sink. Since sensor nodes are made of cheap components and are deployed in remote and uncontrolled environments, they are prone to failure; thus, maintaining a network with its proper functions even when und...

متن کامل

Communication in Networks with Random Dependent Faults

The aim of this paper is to study communication in networks where nodes fail in a random dependent way. In order to capture fault dependencies, we introduce the neighborhood fault model, where damaging events, called spots, occur randomly and independently with probability p at nodes of a network, and cause faults in the given node and all of its neighbors. Faults at distance at most 2 become d...

متن کامل

A new fixed degree regular network for parallel processing

We propose a family of regular Cayley network graphs of degree three based on permutation groups for design of massively parallel systems. These graphs are shown to be based on the shuffle exchange operations, to have logarithmic diameter in the number of vertices, and to be maximally fault tolerant. We investigate different algebraic properties of these networks (including fault tolerance) and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • SIAM J. Comput.

دوره 17  شماره 

صفحات  -

تاریخ انتشار 1988